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Abstract 



We introduce the concept of PAM, Programmable Active Memory and present results obtained 
with our Perle-0 prototype board, featuring: 

• A software silicon foundry for a 50K gate array, with a 50 milliseconds turn-around 
time. 

• A 3000 one bit processors universal machine with an arbitrary interconnect structure 
specified by 400K bits of nano-code. 

• A programmable hardware co-processor with an initial library including: a long 
multiplier, an image convolver, a data compressor, etc. Each of these hardware designs 
speeds up the corresponding software application by at least an order of magnitude. 



Resume 



Nous presentons le concept de memoire active programmable (PAM) et decrivons diverses 
experiences et mesures faites sur notre prototype Perle-0 qui est : 

• Une fonderie logicielle de silicium, pour prediffuses de 50K portes, dont le temps de 
rotation est de 50 millisecondes. 

• Une machine universelle comportant 3000 processeurs un bit. La structure de connexion 
entre les processeurs est arbitraire et sa programmation specifiee par 400K bits de 
nano-code. 

• Un co-processeur materiel programmable dont la bibliotheque initiale comprend : un 
multiplicateur 512 bits, un convolueur d'images noir et blanc, un compresseur de 
donnees, etc. Chacune de ces configurations accelere 1' application logicielle correspon- 
dante de plus d'un ordre de grandeur. 
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1 Concept 

Definition (Programmable Active Memory: PAM) 

A PAM is a uniform array of identical cells all connected in the same repetitive fashion. Each 
cell, called a PAB for programmable active bit, must be general enough so that the following 
holds true: Any synchronous digital circuit can be realized (through suitable p r0 gr" mm """ ] 
on a large enough PAM for a slow enough clock. 
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Figure 1 : A simple PAM 

Figure 1 shows a simple PAM implementation as a regular matrix of Manhattan connected 
identical PABs. Each PAB has: 

• 4 inputs < n,s,e,w>; 

• 4 outputs < N,S,E, W>; 

• 1 register (flip-flop) with input R and output r, synchronous with the PAM's global clock 

elk; 

• 1 combinatorial gate g connected so that: g(n, s, e, w, r) = < N,S,E, W,R>; 

• 160 = 5 X 32 control bits which specify the truth table of function g. 

A program for such a PAM with m active bits is a sequence of 160m control bits (bitstream) 
representing the truth tables for each PAB. This program can be downloaded into the control 
part of the PAM. From this instant, and until the program is changed again, our PAM behaves 
as the particular finite state machine specified by the bitstream. 



Other PAM implementations are possible: 
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• the Configurable Array Logic (CAL) reported in [3] is close to the one above; 

• the Logic Cell Array (LCA) is a cMOS component designed by Xilinx Inc. [7]. Except 
for minor details, it is similar to the above description with each of our < n, s, e, w > and 
< N, S, E, W > variables carrying 4 signals instead of one. The truth table of function g 
is reduced, for obvious practical reasons: only one general 5 input boolean function is 
actually specified; the remaining part of g is used to specify various switchboxes, usable 
to route the signals. 

Our investigation aims at answering the following questions: 

• How to build large PAMs? 

• How fast can it run? 

• Which class of design can we efficiently map onto a PAM? 

• How do we implement PAM designs? 

2 Implementation 

Using a 5 X 5 array of LCA chips, we have built a 40 X 80 ~ 3K bit PAM named Perle-0 
on a 25 x 25 cm 2 printed circuit board. Perle-0 has a VME bus interface, which makes it a 
general-purpose configurable hardware co-processor tightly coupled to a host CPU (Figure 3). 

The configuration data for Perle-0 (about 400K bits) is downloaded by the host itself in 
50 milliseconds. The logic controlling the download process, as well as the host bus 
communication protocol are programmed into two extra LCAs, statically configured at power- 
up time from a PROM. By merely changing the content of that PROM, we are able to quickly 
adapt to different bus protocols, or to add extra features to the bus interface. 

The PAM cycle being much faster than that of the host bus, we added 4 Megabits of fast 
static RAM to Perle-0, directly connected to the PAM (the bandwidth of that memory is up 
to 1.5 Gigabits per second, while the host bus bandwidth is typically around 50 Megabits per 
second). Besides a few mandatory buffers for driving the host bus, Perle-0 is thus built out of 
two kinds of components: LCA and static RAM (Figure 2). 

PAM designs are synchronous logic circuits, each of the registers being updated on each cycle 
of a global clock signal. The maximal clock speed for such a design is directly determined 
by its critical combinatorial path, which varies from one design to another. It has thus been 
necessary to design a clock distribution system whose speed can be programmed as part of 
the design configuration, for speeds up to 70 MHz (the present maximum clock cycle of LCA 
chips). 

Last but not least, we take advantage of an extra feature of the LCA component which makes 
it possible to dynamically read back the content of each PAB. Together with a "software 
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Figure 2: Perle-0 



stepping" facility (stop the main clock and trigger clock cycles one at a time from the host) 
this provides a powerful debugging tool which can take a snapshot of the internal state of the 
design after each clock cycle. 



3 PAM programming 

The logical description of a PAM design consists of the truth table for each PAB and the 
description of their interconnection. In order to derive the final configuration bitstream for 
each LCA, we also need to specify the placement of each block within the PAM, as well as the 
actual routing of each signal through the available channels and switching boxes. 

At the LCA level, both placement and routing may be automatically generated by Xilinx 
software. We have found this to be practical only for routing; automatic placement is far too 
time-consuming and it generally yields poor results: in this technology, each programmable 
interconnection point adds about 2 nanoseconds to the signal propagation delay. Therefore, a 
poor relative block placement may considerably affect the final design performance. 

In our CAD system, the designer needs to specify the logic equations and net names for 
each PAB; optionally he may also control the actual placement of each PAB. This is achieved 
through a program in a high-level language (we use Modula-2-i- and Lisp) or through schematic 
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entries. Netlists are then generated for each of the 25 LCAs of Perle-0 and we use Xilinx 
software to finish the placement, do the actual routing and produce the bitstream for each LCA. 
Finally, the 25 bitstreams are merged into a 400K bits long file ready to be downloaded by the 
host into the PAM. 

The above steps typically take from 15 minutes to a few hours, depending on the complexity of 
the design; subsequent minor modifications are generally much faster. The actual downloading 
is then performed in 50 ms. Since debugging the designs can be done through stepping nr >^ 
readback, this effectively brings hardware design times very close to these of software, 
particular, it provides a vehicule for experimenting with hardware algorithms which would 
otherwise be too time-consuming and costly to implement. 



• Programmed Host Bus 
Interface. 

• Configuration from Host 
in 50 ms. 

• Design loop from 2mn to 
2hrs. 

• Dynamic Read Back for 
Active Bits. 



Figure 3: Perle-0 setup 



4 Design experiments 

We describe here some of the designs which have been successfully tested on Perle-0. More 
designs are currently in progress. 

4.1 Early experiments 

Our first designs aim at debugging and testing the performance of Perle-0 itself: 

• A software-controlled variable clock generator, used as a basic block in building other 
designs. 

• A combinatorial link established through all the pins, in order to test the printed 
circuit board traces for electrical continuity; it also implements one of the worst-case 
combinatorial delay achievable through routing, and runs at 4 MHz. 
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• A 3K shift-register through all of the PAM's active bits. It provides an example of the 
fastest possible synchronous design and proves operational at clock speeds up to 50 
MHz. 

• Designs to read and write Perle-0's RAM from the host. Their speed is limited by the 
VME bus bandwidth, at about 50 Mbits/s. 

Such simple designs typically take a few hours to build, compile and debug. 



• B/W 25MHz Image Convolution 
(B.Chen). 

• 8 registers for 1bx512x512 
images. 

• Host RAM <-> PAM in 12ms. 

• 7x7 convolution in 10ms. 

• Sequence of 4 erode/dilate at 
video rate (25 im/sec). 

• Faster than most 
commercial products. 



• Ziv Lempel Data Compression 
(M. Skubiszewski). 

• Match 4Kpast bytes 
against 16 future bytes. 

•2to5 compression rate. 

• 64 comparisons each 70ns: 

- 1000 MIPS? 
-250 MIPS? 



Figure 4: Two PAM designs 

4.2 Long multiplication 

We have programmed Perle-0 into a fast 512-bit multiplier (see [4]) computing 

p = A x B + S where: 

- A is a m bit integer, with m < 512; 

- B is a n bit integer, with n > 0 arbitrary; 

- S is a n + m bit integer; 

- P is a n + m + 1 bit integer. 
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The host writes the A, B and S operands into the PAM, and reads back the resulting P. 
The speed-up factor achieved by this hardware co-processor can be greater than 25 for long 
operands, compared to carefully hand-coded software. It computes a IK bit product in 210 
microseconds, which makes this PAM design faster than any known machine for which we 
could obtain benchmark measures. It is 4 times faster than a Cyber 170/750. 

The multiplier is interfaced with an arbitrary-precision arithmetic package [6] so that any 
program based on that software takes advantage of the PAM without modification. For 
example, Hans Boehm has run computations for exact real arithmetic [1] with typical speedups 
from 3 to 7. We have written code for modular products and used it for RSA cryptography [5]. 
The PAM computes RSA encryption/decryption at 3900 bits per second for arbitrary 512-bit 
keys; this is about 4.5 times faster than our best software version on the same host. 

The first operational version of the multiplier was developed in less than a week. Two 
subsequent versions, which refined the design on the basis of actual performance measurements, 
were each developed in less than 5 man-days. 

4.3 Data compression 

A hardware accelerator for data compression was implemented in four months by Marcin 
Skubiszewski, who had no previous hardware experience. Based on an algorithm described 
in [2], it is a massively parallel method (64 byte comparisons each 70 ns), which repetitively 
matches the next 16 bytes in the file to be compressed against the last 4k bytes seen (stored in 
the local RAM) in order to detect the longest substring previously seen (see [8]). The resulting 
design achieves an average compression ratio varying from 2 for English text to 3 for source 
code. 

4.4 Image processing 

J. Bradley Chen has implemented the operations of erosion and dilation (two-dimensional 
boolean convolution) on 512 x 512 x 1-bit images, with 7 x 7-bit masks. The design runs at 
25 MHz, computing one bit of the transformed image on each cycle, for a total of 10 ms per 
image. Our best software solution for that problem is at least two orders of magnitude slower. 
This design was implemented in about 3 months of work, once again with minimal previous 
hardware experience. 

5 Conclusion 

PAM is an exciting new technology, with a lot of growth potential. Based on our first year of 
experiments, we derive the following conclusions: 

Prototyping. While custom hardware prototyping times are at best measured in months, 
prototyping with PAM is achieved in weeks. Design times for PAM must be compared 
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with those of software, not of hardware. 

Speed. Assuming equal technology, custom hardware designs are faster than their PAM 
implementation; yet the speed loss appears to be less than 50% for systolic designs with 
short connections. 

Size. While custom hardware designs are obviously smaller than their PAM counterpart, the 
same PAM can be used to run any number of hardware designs ! We regard the PAM as a 
hardware cache: it can be programmed to perform arbitrary tasks, at speeds comparable 
to those of custom hardware. 

Size vs. Speed. All our designs are such that we can double the bandwidth on a twice bigger 
PAM. This provides a strong incentive to build much larger ones. 

Architecture. As a general purpose computing device, Perle-0 exhibits some unique features: 

• The bitstream is a program for Perle-0 which is thus the ultimate one-instruction 
RISC machine (400K wide nano-code!). 

• The PAM is a parallel machine with 3000 one-bit processors. The interconnect 
structure between these processors can be arbitrary, since it is specified by the 
program. Exotic interconnect structures (hypercube and such) are feasible, but 
they use up more PAM area and run with slower clocks. 

• The host and PAM system is a silicon foundry with a 50 ms turn-around time. The 
host can also read back the content of each active bit in the PAM, providing an 
invaluable design independent debugging tool. 

• The PAM is a flexible structure which can be tailored (logic, memory and 
interconnect) to the efficient implementation of any well designed massively 
parallel bit-level algorithm. This should be contrasted with the usual situation 
where the algorithm must be chosen so as to adapt to the structure of the 
implementation machine. 
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